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Abstract 

One of the motivations for property testing of boolean functions is the idea that testing can provide a fast 
preprocessing step before learning. However, in most machine learning applications, it is not possible to request 
for labels of fictitious examples constructed by the algorithm. Instead, the dominant query paradigm in applied 
machine learning, called active learning, is one where the algorithm may query for labels, but only on points in a 
given polynomial-sized (unlabeled) sample, drawn from some underlying distribution D. In this work, we bring 
this well-studied model in learning to the domain of testing. 

We develop both general results for this active testing model as well as efficient testing algorithms for a 
number of important properties for learning, demonstrating that testing can still yield substantial benefits in this 
restricted setting. For example, we show that testing unions of d intervals can be done with 0(1) label requests 
in our setting, whereas it is known to require fl{d) labeled examples for learning (and for passive testing 

iHTIl where the algorithm must pay for every example drawn from D). In fact, our results for testing unions of 
intervals also yield improvements on prior work in both the classic query model (where any point in the domain 
can be queried) and the passive testing model as well. For the problem of testing linear separators in i?" over the 
Gaussian distribution, we show that both active and passive testing can be done with 0{y/n) queries, substantially 
less than the il.{n) needed for learning, with near-matching lower bounds. We also present a general combination 
result in this model for building testable properties out of others, which we then use to provide testers for a number 
of assumptions used in semi-supervised learning. 

In addition to the above results, we also develop a general notion of the testing dimension of a given property 
with respect to a given distribution, that we show characterizes (up to constant factors) the intrinsic number of 
label requests needed to test that property. We develop such notions for both the active and passive testing models. 
We then use these dimensions to prove a number of lower bounds, including for linear separators and the class of 
dictator functions. 

Our results show that testing can be a powerful tool in realistic models for learning, and further that active 
testing exhibits an interesting and rich structure. Our work in addition brings together tools from a range of areas 
including U-statistics, noise-sensitivity, self-correction, and spectral analysis of random matrices, and develops 
new tools that may be of independent interest. 
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1 Introduction 



Property testing and machine learning have many natural connections. In property testing, given black-box access 
to an unknown boolean function /, one would like with few queries to distinguish the case that / has some given 
property V (belongs to the class of functions V) from the case that / is far from any function having that property. 
In machine learning one would like to find a good approximation g of /, typically under the assumption that / 
belongs to a given class V. This connection is in fact a natural motivation for property testing: to cheaply determine 
whether learning with a given hypothesis class is worthwhile |[32l[56l . If the labeling of examples is expensive, or if 
a learning algorithm is computationally expensive to run, or if one is deciding from what source to purchase one's 
data, performing a cheap test in advance could be a substantial savings. Indeed, query-efficient testers have been de- 
signed for many common function classes considered in machine learning including linear threshold functions |[49l . 
juntas |[28l[Tn . DNF formulas ll25l . and decision trees [,25 J . (See Ron's survey |[56l for much more on the connection 
between learning and property testing.) 

However, there is a disconnect between the most commonly used property-testing and machine learning models. 
Most property-testing algorithms rely on the ability to query functions on arbitrary points of their choosing. On 
the other hand, most machine learning problems unfortunately do not allow one to perform queries on fictitious 
examples constructed by an algorithm. Consider, for instance, a typical problem such as machine learning for 
medical diagnosis. Given a large database of patients with each patient described by various features (height, age, 
family history, smoker or not, etc.), one would like to learn a function that predicts from these features whether or 
not a patient has a given medical condition (diabetes, for example). To perform this learning task, the researchers can 
run a (typically expensive) medical test on any of the patients to determine if the patient has the medical condition. 
However, reseaixhers cannot ask whether the patient would still have the disease were the values of some of his 
features changed! Moreover, reseaixhers cannot make up a feature vector out of whole cloth and ask if that feature 
vector has the disease. As another example, in classifying documents by topic, selecting an existing document 
on the web and asking a labeler "Is this about sports or business?" may be perfectly reasonable. However, the 
typical representation of a document in a machine learning sytem is as a vector of word-counts in i?" (a "bag of 
words", without any information about the order in which they appear in the document). Thus, modifying some 
existing vector, or creating a new one from scratch, would not produce an object that we could expect a human 
labeler to easily classify. The key issue is that for most problems in machine leai^ning, the example and the label 
are in fact both functions of some underlying more complex object. Even in cases such as image classification — 
e.g., classifying handwritten digits into the numerals they represent — where a human labeler would be examining 
the same representation as the computer, queries can be problematic because the space of reasonable images is a 
very sparse subset of the entire domain. Indeed, now-classic experiments on membership-query learning algorithms 
for digit recognition ran into exactly this problem, leading to poor results Q. In this case, the problem is that the 
distribution one cares about (the distribution of natural handwritten digits) is not one that the algorithm can easily 
construct new examples from. 

As a result of these issues, the dominant query paradigm in machine learning in recent years is not one where 
the algorithm can make arbitrary queries, but instead is a weaker model known as active learning |[58l [TTl |6T1 [191 
m |9l [m [36l 111] |45l. In active learning, there is an underlying distribution D over unlabeled examples (say the 
distribution of documents on the web, represented as vectors over word-counts) that we assume can be sampled 
from cheaply: we assume the algorithm may obtain a polynomial number of samples from D. Then, the algorithm 
may ask an oracle for labels (these oracle calls are viewed as expensive), but only on points in its sample. The goal of 
the active learning algorithm is to produce an accurate hypothesis while requesting as few labels as possible, ideally 
substantially fewer than in passive learning where every example drawn from D is labeled by the oracle. 

In this work, we bridge this gap between testing and learning by introducing, analyzing, and developing efficient 
algorithms for a model of testing that parallels active learning, which we call active testing. As in active learning, 
we assume that our algorithm is given a polynomial number of unlabeled examples from the underlying distribution 
D and can then make label queries, but only over the points in its sample. From a small number of such queries, 
the algorithm must then answer whether the function has the given property, or is far, with respect to D, from any 
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function having that property (see Section|2]for formal definitions). We show that even with this restriction, we can 
still efficiently test important properties for machine learning including unions of intervals, linear separators, and 
a number of properties considered in semi-supervised learning. Moreover, these testers reveal important structural 
characteristics of these classes. We additionally develop a notion of testing dimension that characterizes the number 
of examples needed to test a given property with respect to a given distribution, much like notions of dimension 
in machine learning. We do this for both the active testing model and the weaker passive testing model ll32l 1411 
in which only random sampling of a small number points from the distribution is allowed. In fact, as part of our 
analysis, we also develop improved algorithms for several important classes for the passive testing model as well. 
Overall, our results demonstrate that active testing exhibits an interesting and rich structure and strengthens the 
connection between testing and learning. 

1.1 Our Results 

We show that for a number of important properties for learning — including unions of intervals, linear threshold 
functions, and various assumptions used in semi-supervised learning — one can test in the active testing model with 
substantially fewer labels than needed to learn. We in addition consider the even more stringent passive testing 
model introduced by Goldreich, Goldwasser, and Ron ||32l (where the only operation available to the algorithm is to 
draw a random labeled sample from D) and give new positive results for that model as well. We further show that 
for both active and passive testing models, we can characterize (up to constant factors) the intrinsic number of label 
requests needed to test any given property V with respect to any given distribution D in a new quantity we call the 
testing dimension of V with respect to D. We then use these dimension notions to prove several near-tight lower 
bounds. We expand on each of these points below. 

Unions of intervals. The function / : [0, 1] — >• {0, 1} is a union of d intervals if the set consists of at most 

d intervals in [0, 1]. It is known that Q{d) queries are necessary and sufficient for learning functions from this class. 
Kearns and Ron PTI showed that under the uniform distribution, the relaxed problem of distinguishing unions of d 
intervals from functions that are e-far from unions of d/e intervals can be done with a constant number of queries 
in the standard arbitrary-query testing model, and with 0{\fd) samples in the passive testing model. However, prior 
to the current work, no non-trivial upper bound was known for the problem of distinguishing unions of d intervals 
from functions e-far from unions of d intervals (as opposed to far from d/e intervals). 

We give an algorithm that tests unions of d intervals with only 0(1) queries in the active testing model. This 
result holds over any underlying distribution (known or unknown). Moreover, in the case that the underlying distri- 
bution is uniform, we require only 0{\fd) unlabeled samples. Thus, as a byproduct we improve over the prior best 
result in the passive testing model as well. Note that Kearns and Ron PTI show that Q.{\fd) examples are required 
to test unions of intervals over the uniform distribution in the passive testing model, so this result is tight. Moreover, 
one can show that in the distribution-free testing model of Halevy and Kushilevitz |[35l one cannot perform testing 
of this class from 0(1) queries; thus, this class demonstrates a separation between these models (see Appendix IXb. 

At the heart of the analysis of our algorithm is a characterization of functions that are unions of intervals in 
terms of their noise sensitivity, shown via developing a local self-corrector for this class. The noise sensitivity 
of boolean functions is a powerful tool that has led to recent advances in hardness of approximation H2l |52l . 
learning theory B3l l44l l24l . and differential privacy [161. (See also fSSj for more details on the applications of 
noise sensitivity to the study of boolean functions.) Our work presents a novel application of noise sensitivity in the 
domain of property testing. 

Linear threshold functions. The function / : M" — {0, 1} is a linear threshold function if there are n + 1 
parameters wi, . . . ,Wn,0 G IK such that f{x) = sgn(ii;ixi + • • • + WnXn — 0) for every x S M"". Linear threshold 
functions are perhaps the most widely-used function class in machine learning. We show that both active and passive 
testing of testing linear threshold functions in R'^ can be done with 0{y/n) labeled examples over the Gaussian 
distribution. This is substantially less than the 0(n) labeled examples needed for leai^ning (even over the Gaussian 
distribution BTll ) and yields a new upper bound for the passive testing model as well. The key challenge here is that 
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estimating a statistic due to Matulef et al. [49 ,1 — which can be done with 0(1) queries if arbitrary queries are allowed 
B9l — would require 0(n) samples if done from independent pairs of random examples in the natural way; this is 
no better than learning. We overcome this obstacle by re-using non-independent pairs of examples in the estimation, 
together with an analysis and modification of the statistic that allow for use of a theorem of Arcones [31 on the 
concentration of U-statistics. At a technical level, this result uses the fact that even though typical values of {x ■ y)^ 
may be quite large — i.e., Q{n) — when x and y have every coordinate selected from the standard normal, for any 
boolean function / it will be the case that for "most" values y, the quantity {Kx[f{x)x • y])^ is quite small — which 
can be shown via a Fourier decomposition of /. This in turn allows one to show strong concentration. 

Interestingly, we show these bounds are nearly tight, giving lower bounds of ^{n^^^) and i),{y/n) on the number 
of labeled examples needed for active and passive testing respectively. The proof of these lower bounds reUes on our 
notion of active and passive testing dimensions. More precisely, by using the notion of dimension, we reduce the 
problem of proving the lower bounds to that of bounding the operator norm of random matrices. This task is then 
completed by appealing to recent results on the non-asymptotic analysis of random matrices |[64l . Our lower bound 
demonstrates a separation between the active model and the standard (ai^bitrary-query) testing model. 

Disjoint unions of testable properties. We also show that any disjoint union of testable properties remains testable 
in the active testing model, allowing one to build testable properties out of simpler components; this is then used 
to provide label-efficient testers for several properties used in semi-supervised learning including the cluster and 
margin assumptions. See Section[5]for details. 

Testing dimension. One of the most powerful notions in learning theory is that of the dimension or intrinsic com- 
plexity of a class of functions. Such notions of dimension (e.g., VC dimension |[63l . SQ dimension fT2l, Rademacher 
complexity lH) have been exceedingly effective in determining the sample complexity for learning classes of func- 
tions in various learning models. Y. Mansour and G. Kalai (personal communication, see also 1401 ) posed the 
question of whether comparable notions of dimension might exist for testing. In this work, we answer in the affirma- 
tive and introduce the first such notions of dimension for property testing, for both our new model of active testing 
and the passive testing model. 

We show that these notions of testing dimension characterize (up to constant factors) the intrinsic number of 
labeled examples required to test the given property with respect to a given distribution in the active and passive 
testing models, respectively. We also introduce a simpler "coarse" notion of testing dimension that characterizes the 
set of properties testable with 0(1) queries in the active testing model. 

We use these testing dimensions to obtain lower bounds on the query complexity for testing a number of dif- 
ferent properties in both active and passive testing models. Notably, we show that r2(logn) queries are needed to 
distinguish dictator functions from random functions in both models. This shows that active testing of dictators is as 
hai^d as learning dictator functions, and also implies a lower bound of 0,(log n) queries for testing a large number of 
properties — including decision trees, functions of low Fourier degree, juntas, DNFs — in the active testing modelQ 

1.2 Related Work 

Active learning. Active learning has become a topic of substantial importance in machine learning due to the rise 
of applications in which unlabeled data can be sampled much more cheaply than data can be labeled, including text 
classification ||50ll6T1l . medical imaging |[39l . and image and music retrieval 16011481 among many others |[30l l65l 
l66l . This has led to significant work in algorithmic development including a yearly active-learning competition, 
with monetary prizes^ There has also been substantial progress in the theoretical understanding of its underlying 
principles, including both algorithmic guarantees and the design and analysis of appropriate sample complexity 
measures for this setting ||23lllll|9l[ll[IEl2ll2n[36l|45l[l0l|2l[Il[S Active learning, unlike passive learning, 

'Building on tiiis analysis, Noga Alon (personal communication) has recently developed a stronger Q.{k logn) lower bound for the active 
testing dimension of juntas via use of the Kim-Vu polynomial method. 
^See http://www.causality.inf.ethz.ch/activelearning.php. 



3 



has no known strong Structural Risk Minimization bounds, which further motivates our work. We note that while our 
model is motivated by active learning, our techniques are very different from those in the active learning literature. 

Other Testing Models. In addition to the standard model of property testing |[57l and the passive model of property 
testing ll32l |4TI discussed above, other models have been introduced to address different testing scenarios. The 
tolerant testing model, introduced by Parnas, Ron, and Rubinfeld ||54l was introduced to model situations where 
the tester must not only accept functions that have a given property but also must accept functions that are close to 
having the property. The distribution-free testing model was introduced by Halevy and Kushilevitz |[35l (see also 
ll33l[34l[3ni26l ) to explore the setting where the tester does not know the underlying distribution D. Both of these 
models allow arbitrary queries, however, and so do not address the machine learning settings motivating this work in 
which one can only query inputs from a large sample of unlabeled points. In Appendix |Al we discuss the technical 
relations between active testing and these other models. 

2 The Active Property Testing Model 

A property V of boolean functions is simply a subset of all boolean functions. We will also refer to properties as 
classes of functions. The distance of a function / to the property V with respect to a distribution D over the domain 
of the function is dist£){f,V) := m.mg^-pFrx^D[f{x) ^ 9{x)]- A tester for "P is a randomized algorithm that 
must distinguish (with high probability) between functions in V and functions that are far from V. In the standard 
property testing model introduced by Rubinfeld and Sudan |[57l . a tester is allowed to query the value of the function 
on any input in order to make this decision. We consider instead a model in which we add restrictions to the possible 
queries: 

Definition 2.1 (Property tester). An s-sample, q-query e-tester for V over the distribution D is a randomized algo- 
rithm A that draws a sample S of size s from D, queries for the value of f on q points of S, and then 

1. Accepts w.p. at least | when / G and 

2. Rejects w.p. at least | when distD{f,'P) > £■ 

We will use the terms "label request" and "query" interchangeably. Definition 12.11 coincides with the standard 
definition of property testing when the number of samples is unlimited and the distribution's support covers the entire 
domain. In the other extreme case where we fix g = s, our definition then con^esponds to the passive testing model 
of Goldreich, Goldwasser, and Ron |[32| , where the inputs queried by the tester are sampled from the distribution. 
Finally, by setting s to be polynomial in an appropriate measure of the input domain or property V, we obtain the 
active testing model that is the focus of this paper: 

Definition 2.2 (Active tester). A randomized algorithm is a q-query active e-tester for V C {0, 1}" — )■ {0, 1} over 
D if it is a poly(n)-sample, q-query e-tester for V over Z)0 

In some cases, the domain of our functions is not {0, 1}". In those cases, we require s to be polynomial in 
some other appropriate measure of complexity of the domain or property V that we specify explicitly. Note that 
in Definition 12.11 since we do not have direct membership query access (at arbitrary points), our tester must accept 
w.p. at least | when / is such that dist£){f, V) = 0, even if / does not satisfy V over the entire input space. See 
Appendix 1X1 for a comparison of active testing to other testing models. 

3 Testing Unions of Intervals 

The function / : [0, 1] — )■ {0, 1} is aunion ofd intervals if there are at most d non-overlapping intervals [^i, ui], . . . , [id, Ud] 
such that /(x) = 1 iff ii < x < Ui for some i G [d]. The VC dimension of this class is 2d, so learning a union 

^ We emphasize that the name active tester is chosen to reflect the connection with active learning. It is not meant to imply that this model 
of testing is somehow "more active" than the standard property testing model. 
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of d intervals requires Q{d) queries. By contrast, we show that active testing of unions of d intervals can be done 
with a number of label requests that is independent of d, for any (even unknown) distribution D. Specifically, we 
prove that we can test unions of d intervals in the active testing model using only 0{l/e^) label requests from a set 
of poly{d, 1/e) unlabeled examples. Furthermore, over the uniform distribution, we need a total of only 0{\fd/e^) 
unlabeled examples. Note that previously it was not known how to test this class from 0(1) queries even in the 
(standard) membership query model even over the uniform distribution^ 

Theorem 3.1. For any (known or unknown) distribution D, testing unions of d intervals in the active testing model 
can be done using only 0(l/e^) queries. In the case of the uniform distribution, we further need only 0{-\fd/e^) 
unlabeled examples. 

We prove Theorem 13. II by beginning with the case that the underlying distribution is uniform over [0, 1], and 
afterwards show how to generalize to arbitrary distributions. Our tester is based on showing that unions of intervals 
have a noise sensitivity characterization. 

Definition 3.2. Fix (5 > 0. The local 6-noise sensitivity of the function / : [0, 1] — {0, 1} at x G [0, 1] is 

ISiSsif, ^) = Pi'y^ix-l/ls;) / /(y)]) where y x represents a draw of y uniform m {x — 5,x + 6) r\ [0, 1]. The 
noise sensitivity of / is 

m,{f)= Pr [f{x)^f{y)\ 

or, equivalently, ms{f) = E^NS5(/,a;). 

A simple argument shows that unions of d intervals have (relatively) low noise sensitivity: 

Proposition 3.3. Fix 5 > and let f : [0, 1] — {0, 1} be a union of d intervals. Then fiSs{f) < dS. 

Proof sketch. Draw x G [0, 1] uniformly at random and y ~(5 x. The inequality /(x) ^ f{y) can only hold when a 
boundary b G [0, 1] of one of the d intervals in / lies in between x and y. For any point b G [0, 1], the probability 
that x<b<y or y<b<x is at most |, and there are at most 2d boundaries of intervals in /, so the proposition 
follows from the union bound. □ 

The key to the tester is showing that the converse of the above statement is approximately true as well: for 5 
small enough, every function that has noise sensitivity not much larger than dS is close to being a union of d intervals. 
(Full proof in Appendix O. 

2 

Lemma 3.4. Fix 6 = Let f : [0, 1] — t- {0, 1} be a function with noise sensitivity bounded by f^Ssif) < 
d5{l + |). Then f is e-close to a union ofd intervals. 

Proof outline. The proof proceeds in two steps. First, we show that so long as / has low noise-sensitivity, it can 
be "locally self-corrected" to a function g : [0, 1] — )■ {0, 1} that is |-close to / and is a union of at most d{l + |) 
intervals. We then show that g - and every other function that is a union of at most d{l + |) intervals - is |-close to 
a union of d intervals. 

To construct the function g, we consider a smoothed function : [0, 1] — [0, 1] obtained by taking the convolu- 
tion of / and a uniform kernel of width 25. We define r to be some appropriately small parameter. When fs{x) < r, 
then this means that nearly all the points in the 5-neighborhood of x have the value in /, so we set g{x) = 0. 
Similai^ly, when fs{x) > 1 — r, then we set g{x) = 1. (This procedure removes any "local noise" that might be 
present in /.) This leaves all the points x where t < fs{x) < 1 — t. Let us call these points undefined. For each 
such point x we take the largest value y < x that is defined and set g{x) = g{y). The key technical part of the proof 
involves showing that the construction described above yields a function g that is | -close to / and that is a union of 
d{l + j) intervals. Due to space constraints, we defer the argument to Appendix ICl □ 

''The best prior result achieved a relaxed guarantee of distinguishing the case that / is a union of d intervals from the case that / is e-far 
from a union of d/e intervals [411 . 
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The noise sensitivity characterization of unions of intervals obtained by Proposition 13.31 and Lemma [3]4] suggest 
a natural approach for building a tester: design an algorithm that estimates the noise sensitivity of the input function 
and accepts iff this noise sensitivity is small enough. This is indeed what we do: 

Union of Intervals Tester( f,d,e) 
Parameters: 6 = = 0(e~^). 

1. For rounds i = 1, . . . ,r, 

1.1 Draw X E [0, 1] uniformly at random. 

1.2 Draw samples until we obtain y G {x — 6,x + 6). 

1.3 SetZ, = l[/(x)//(y)]. 

2. Accept iff lJ2Zi< d5{l + f ). 

The algorithm makes 2r = 0(e~^) queries to the function. Since a draw in Step 1.2 is in the desired range with 
probability 2S, the number of samples drawn by the algorithm is a random variable with very tight concentration 
around r(l + ^) = 0{d/e^). The draw in Step 1.2 also corresponds to choosing y ~5 x. As a result, the probability 
that f{x) / f{y) in a given round is exactly NS5(/), and the average ^ ^ Zj is an unbiased estimate of the noise 
sensitivity of /. By Proposition 13. 3[ Lemma [X4l and Chernoff bounds, the algorithm therefore en^s with probability 
less than ^ provided that r > c • 1/ {d5e^ ) = c • for some suitably large constant c. 

Improved unlabeled sample complexity: Notice that by changing Steps 1.1-1.2 slightly to pick the first pair (x, y) 
such that — y| < 5, we immediately improve the unlabeled sample complexity to 0{\fd/e^) without affecting the 
analysis. In particular-, this procedure is equivalent to picking x G [0, 1] then y ~5 xO As a result, up to poly{l/e) 
terms, we also improve over the passive testing bounds of Keams and Ron BH which are able only to distinguish 
the case that / is a union of d intervals from the case that / is e-far from being a union of d/e intervals. (Their results 
use 0{Vd/e^'^) examples.) Kearns and Ron [41J show that U{Vd) examples are necessary for passive testing, so in 
terms of d this is optimal. 

Active testing over arbitrary distributions: We now consider the case that examples are drawn from some arbitrary 
distribution D. First, let us consider the easier case that D is known. In that case, we can reduce the problem of 
testing over general distributions to that of testing over the uniform distribution on [0, 1] by using the CDF of D. In 
particular, given point x, define p^; = Pr.y^£)[y < x]. So, for x drawn from D, px is uniform in [0, 1]0 As a result 
we can just replace Step 1.2 in the tester with sampling until we obtain y such that py G {px — 5,px + 5). Now, 
suppose D is not known. In that case, we do not know the px and py values exactly. However, we can use the fact 
that the VC-dimension of the class of initial intervals on the line equals 1 to uniformly estimate all such values from 
a polynomial-sized unlabeled sample. In particular, 0(1/7^) unlabeled examples are sufficient so that with high 
probability, every point x has property that the estimate px of px computed with respect to the sample (the fraction 
of points in the sample that are < x) will be within 7 of the correct px value |[T3l . If we define NS5(/) to be the 
noise-sensitivity of / computed using these estimates, then we get |q^NS5_^(/) < NS5(/) < |^NS5+^(/). This 
implies that 7 = 0{e5) is sufficient so that the noise-sensitivity estimates are sufficiently accurate for the procedure 
to work as before. 

Putting these results together, we have Theorem 13. II 

^Except for events of 0{5) probability mass at the boundary. 

*We are assuming here that D is continuous and has a pdf. If D has point masses, then instead define = Pr^ [y < x\ and = 
Pry [2/ < x\ and select uniformly in [p^,p^]. 
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4 Testing Linear Threshold Functions 



A boolean function / : M" — )• {0, 1} is a linear threshold function (LTF) if there exist n + 1 real-valued parameters 
wi, . . . , Wn, such that for each x G M", we have f{x) = sgn(u'ixi + • • • WnXn — ^)IZl The main result of this 
section is that it is possible to efficiently test whether a function is a linear threshold function in the active and passive 
testing models with substantially fewer labeled examples than needed for learning, along with near-matching lower 
bounds. 

Theorem 4.1. We can efficiently test linear threshold functions under the Gaussian distribution with 0{-\/n\ogn) 
labeled examples in both active and passive testing models. Furthermore, no (even computationally inefficient) 
algorithm can test with d{-n}/'^) labeled examples for active testing or d{^/n) labeled examples for passive testing. 

Note that the class of linear threshold functions requires il{n) labeled examples for learning, even over the 
Gaussian distribution BTl . Linear threshold functions can be tested with a constant number of queries in the standard 
(arbitrary query) property testing model B9l . 

The starting point for the upper bound in Theorem 14. II is a characterization lemma of linear threshold functions 
in terms of the following self-correlation statistic. To be precise, we are scaling so that each coordinate is drawn 
independently from M{0, 1) — so a typical example will have length @{^/n). 

Definition 4.2. The self -correlation coefficient of the function / : M" — M is p{f) := Kx,y[f{x)f{y) {x, y)]. 

Lemma 4.3 (Matulef et al. II49II ). There is an explicit continuous function : M — )■ M with bounded derivative 
ll^'lloo ^ 1 <^nd peak value W{0) = ^ such that every linear threshold function f : R" — t- { — 1, 1} satisfies 
p{f) = W{E,xf). Moreover, every function : M" — )• {—1, 1} that satisfies \p{g) — W{E,xg)\ < 4e^, is e-close to 
being a linear threshold function. 

The proof of Lemma l4.3l relies on the Hermite decomposition of functions. In fact, the original characterization 
of Matulef et al. P9l is stated in terms of the level- 1 Hermite weight of functions. The above characterization follows 
easily from their result. For completeness, we include the details in Appendix iDl 

Lemma l43] suggests an obvious approach to testing for linear threshold functions from random examples: simply 
estimate the self-correlation coefficient of Definition 14.21 bv repeatedly drawing pairs of labeled examples {xi,yi) 
from the Gaussian distribution in i?" and computing the empirical average of the quantities f{xi)f{yi) {xi,yi) 
observed. The problem with this approach, however, is that the dot-product {xi,yi) will typically have magnitude 
'8>i\/n) (one can view it as essentially the result of an n step random walk). Therefore to estimate the self-correlation 
coefficient to accuracy 0(1) via independent random samples in this way would require r2(n) labeled examples. This 
is of course not very useful, since it is the same as the number of labeled examples needed to learn an LTF. 

We will be able to achieve an improved bound, however, using the following idea: rather than averaging 
over independent pairs {x,y), we will draw a smaller sample and average over all (non-independent) pairs within 
the sample. That is, we request q random labeled examples xi, . . . ,Xq, and now estimate p{f) by computing 

(2) ^ Si<i f(^i)f(^j) {^ii^j)- Of course, the terms in the summation are no longer independent. However, they 
satisfy the property that even though the quantity f{x)f{y) {x, y) is typically large, for most values y, the quantity 
'Ex[f{x)f{y) {x, y)] is small. (This can be shown via a Fourier decomposition of the function /.) This, together with 
additional truncation of the quantity in question, will allow us to apply a Bernstein- type inequality for U-statistics 
due to Arcones Q in order to achieve the desired concentration. 

The resulting LTF TESTER is given in Figure [T] This algorithm has two advantages. First, it is a valid tester 
in both the active and passive property testing models since the q inputs queried by the algorithm are all drawn 
independently at random from the standard n-dimensional Gaussian distribution. Second, the algorithm itself is 
very simple. As in many cases with property testing, however, the analysis of this algorithm is more challenging. 

' Here, sgn(2) = 1[2 > 0] is the standard sign function. 
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LTF Tester( /, e ) 



Parameters: r = log(4n/e3), m = 800r/e^ + 32/e^. 




5. Accept iff \p - W{il)\ < 2e'. 



Figure 1: LTF TESTER 



Given Lemma |431 as noted above, the key challenge in the proof of correctness of the LTF TESTER is controlling 
the error of the estimate p of p{f) in Step 4, which we do with concentration of measure results for U-statistics. The 
U-statistic (of order 2) with symmetric kernel function g : M" x M" — >■ M is 



U-statistics are unbiased estimators of the expectation of their kernel function and, even more importantly, when 
the kernel function is "well-behaved", the tails of their distributions satisfy strong concentration. In our case, the 



Lemma 4.4 (Arcones 131). For a symmetric function /i : M" x M" M, let Y? = E^.[Ey[/i(x, y)]'^]-¥.x,y[h{x, y)]^, 
let b = \\h — Kh\\oo, and let Uq{h) be a random variable obtained by drawing x^, . . . ,x'^ independently at random 
and setting Uq{h) = ^ ^i<j h{x^,x^). Then for every t > 0, 



An argument combining Lemma|43]with a separate argument showing that g is "close" to an unbiased estimator 
for p{f) provides the desired guarantee for the LTF TESTER. The complete proof is presented in Appendix iDl 

It is natural to ask whether we can further improve the query complexity of the tester for linear threshold func- 
tions by using U-statistics of higher order. The lower bound in Theorem 14.11 shows that this — or any other possible 
active or passive testing approach — cannot yield a query complexity sub-polynomial in n. We defer the discussion 
of this lower bound to Section [6l where we will use the notion of testing dimension to establish the bound. 

5 Testing Disjoint Unions of Testable Properties 

We now show that active testing has the feature that a disjoint union of testable properties is testable, with a number 
of queries that is independent of the size of the union; this feature does not hold for passive testing. In addition 
to providing insight into the distinction between the two models, this fact will be useful in our analysis of semi- 
supervised learning -based properties mentioned below and discussed more fully in Appendix iGl 

Specifically, given properties Vi, . . . , Vn over domains Xi, . . . , Xn, define their disjoint union V over domain 
X = {{i,x) : i G [A^], x G Xi} to be the set of functions / such that f{i, x) = fi{x) for some /j G Vi. In addition, 
for any distribution D over X, define Di to be the conditional distribution over Xi when the first component is i. If 
each Vi is testable over Di then V is testable over D with only small overhead in the number of queries: 




thresholded kernel function g{x,y) 



f{x)f{y) {x,y) I (x,y) I < r 



allows us to apply Arcones' theorem. 



otherwise 
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Theorem 5.1. Given properties "Pi, ... , Vn, if each Vi is testable over Di with q{e) queries and U (e) unlabeled 
samples, then their disjoint union V is testable over the combined distribution D with 0(g(e/2) • (log^ ^)) queries 
and 0{U{e/2) ■ log^ ^)) unlabeled samples. 

Proof. See Appendix |El □ 

As a simple example, consider Vi to contain just the constant functions 1 and 0. In this case, V is equivalent 
to what is often called the "cluster assumption," used in semi-supervised and active learning |[T5l |20l . that if data 
lies in some number of clearly identifiable clusters, then all points in the same cluster should have the same label. 
Here, each Vi individually is easily testable (even passively) with 0(l/e) labeled samples, so Theorem 15.11 implies 
the cluster assumption is testable with poly{l/e) queries|^ However, it is not hard to see that passive testing with 
poly{l/e) samples is not possible and in fact requires n{VN/e) labeled examples]!! 

We build on this to produce testers for other properties often used in semi-supervised learning. In particular, one 
common assumption used (often called the margin or low-density assumption) is that there should be some large 
margin 7 of separation between the positive and negative regions (but without assuming the target is necessarily a 
linear threshold function). Here, we give a tester for this property, which uses a tester for the cluster property as a 
subroutine, along with analysis of an appropriate weighted graph defined over the data. Specifically, we prove the 
following result (See Appendix |G] for definitions and analysis). 

Theorem 5.2. For any 7, 7' = 7(1 — 1 /c) for constant c > 1, for data in the unit ball in R'^ for constant d, we can 
distinguish the case that Df has margin ^ from the case that Df is e-farfrom margin 7' using Active Testing with 
0(l/(7^'^e^)) unlabeled examples and 0(l/e) label requests. 



6 General Testing Dimensions 

The previous sections have discussed upper and lower bounds for a variety of classes. Here, we define notions of 
testing dimension for passive and active testing that characterize (up to constant factors) the number of labels needed 
for testing to succeed, in the corresponding testing protocols. These will be distribution-specific notions (like SQ 
dimension |[T2l or Rademacher complexity IS in learning), so let us fix some distribution D over the instance space 
X, and furthermore fix some value e defining our goal. I.e., our goal is to distinguish the case that dist£){f, V) = 
from the case distj;){f,V) > e. 

For a given set S of unlabeled points, and a distribution vr over boolean functions, define vr^ to be the distribution 
over labelings of S induced by vr. That is, for y G {0, l}''^! let TTs{y) = Prj^7r[/('S') = y]- We now use this to 
define a distance between distributions. Specifically, given a set of unlabeled points S and two distributions vr and 
vr' over boolean functions, define 

d5(vr,^') = (i/2) E \^s{y)-^'siy)\, 

to be the variation distance between vr and vr' induced by S. Finally, let XIq be the set of all distributions vr over 
functions in V, and let set lie be the set of all distributions vr' in which a 1 — o(l) probability mass is over functions 
at least e-far from V. We are now ready to formulate our notions of dimension. 

^Since the Vi are so simple in this case, one can actually test with only 0(l/e) queries. 

'Specifically, suppose region 1 has 1 — 2e probability mass with /i G Vi, and suppose the other regions equally share the remaining 2e 
probability mass and either (a) are each pure but random (so f £ V) oi (b) are each 50/50 (so / is e-far from V). Distinguishing these cases 
requires seeing at least two points with the same index i 7^ 1, yielding the Q{\/7^ /e) bound. 
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6.1 Passive Testing Dimension 

Definition 6.1. Define the passive testing dimension, dpassive = dpassiveCP, D), as the largest (7 G N such that, 

sup sup Pr (ds(vr,7r') > 1/4) < 1/4. 

TreHo tt'SHe 5~-D9 

That is, there exist distributions vr G Ho and vr' G lie such that a random set S" of dpassive examples has a 
reasonable probability (at least 3/4) of having the property that one cannot reliably distinguish a random function 
from vr versus a random function from vr' from just the labels of S. From the definition it is fairly immediate that 
^{dpassive) examples are necessary for passive testing; in fact, one can show that 0{dpassive) are sufficient as well. 

Theorem 6.2. The sample complexity of passive testing property V over distribution D is ©{dpassivei'P, D)). 

Proof. See Appendix 10 □ 



Connections to VC dimension. This notion of dimension brings out an interesting connection between learning 
and testing. In particular, consider the special case that we simply wish to distinguish functions in V from truly 
random functions, so vr' is the uniform distribution over all functions (this is indeed the form used by our lower 
bound results in Sections [6.31 and |64l ). In that case, the passive testing dimension becomes the largest q such that 
for some (multi)set F of functions /j G a typical sample S of size q would have all 2'^ possible labelings occur 
approximately the same number of times over the functions /j G F. In conti'ast, the VC-dimension of V is the largest 
q such that for some sample 5 of size q, each of the 2'^ possible labelings occurs at least once. Notice there is a kind 
of reversal of quantifiers here: in a distributional version of VC-dimension where one would like a "typical" set S to 
be shattered, the functions that induce the 2'? labelings could be different from sample to sample. However, for the 
testing dimension, the set F must be fixed in advance. That is the reason that it is possible for a tester to output "no" 
even though the labels observed are still consistent with some function in V. 



6.2 Active Testing Dimension 

For the case of active testing, there are two complications. First, the algorithms can examine their entire poly{n)- 
sized unlabeled sample before deciding which points to query, and secondly they may in principle determine the 
next query based on the responses to the previous ones (even though all our algorithmic results do not require this 
feature). If we merely want to distinguish those properties that are actively testable with 0(1) queries from those 
that are not, then the second complication disappears and the first is simplified as well, and the following coarse 
notion of dimension suffices. 

Definition 6.3. Define the coarse active testing dimension, dcoarse = dcoarseiV, D), as the largest g G N such that, 

sup sup Pr (ds(vr,7r') > 1/4) < 1/n^. 
TreHo TT'en^ s^^" 

Theorem 6.4. // d coarsei'P 1 D) — 0{1) the activc testing of T' over D can be done with 0(1) queries, and if 
dcoarsei'P, D) = a;(l) then it cannot 

Proof. See Appendix [0 □ 

To achieve a more fine-grained characterization of active testing we consider a slightly more involved quantity, 
as follows. First, recall that given an unlabeled sample U and distribution vr over functions, we define ttu as the 
induced distribution over labelings of U. We can view this as a distribution over unlabeled examples in {0, Ijl'^L 
Now, given two distributions over functions vr, vr', define Fair(7r, vr', U) to be the distribution over labeled examples 
{y,i) defined as: with probability 1/2 choose y ~ ttu, i = 1 and with probability 1/2 choose y ~ vr'^, £ = 0. Thus, 
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for a given unlabeled sample U, the sets JIq and define a class of fair distributions over labeled examples. The 
active testing dimension, roughly, asks how well this class can be approximated by the class of low-depth decision 
trees. Specifically, let DT^ denote the class of decision trees of depth at most k. The active testing dimension for a 
given number u of allowed unlabeled examples is as follows: 

Definition 6.5. Given a number u = poly{n) of allowed unlabeled examples, we define the active testing dimension, 

dactive{u) = dactiveiV, D, u), as the largest q^n such that 

sup sup Pr (err*(DTg,Fair(7r,7r',C/)) < 1/4) < 1/4, 
TTsnoTT'en, u-^d-^ 

where err* {H, P) is the error of the optimal function in H with respect to data drawn from distribution P over 
labeled examples. 

Tlieorem 6.6. Active testing of property V over distribution D with failure probability | using u unlabeled exam- 
ples requires ^{dactive{'P,D,u)) label queries, and furthermore can be done with 0{u) unlabeled examples and 
0{dactive{'P, D,u)) label queries. 

Proof. See Appendix 10 □ 
We now use these notions of dimension to prove lower bounds for testing several properties. 



6.3 Application: Dictator functions 

We prove here that active testing of dictatorships over the uniform distribution requires Q{logn) queries by proving 
a r2(logn) lower bound on dactive{u) for any u = poly{n); in fact, this result holds even for the specific choice of 
tt' as random noise (the uniform distribution over all functions). 

Theorem 6.7. Active testing of dictatorships under the uniform distribution requires il(logn) queries. This holds 
even for distinguishing dictators from random functions. 

Proof. Define vr and vr' to be uniform distributions over the dictator functions and over all boolean functions, re- 
spectively. In particular, vr is the distribution obtained by choosing i G [n] uniformly at random and returning the 
function / : {0, 1}" — {0, 1} defined by f{x) = Xj. Fix S to be a set of q vectors in {0, 1}". This set can be 
viewed as a. q x n boolean- valued matrix. We write ci{S), . . . , Cn{S) to represent the columns of this matrix. For 
any y G {0, 1}", 

( \ l{^ g N : Ci{S) = y}\ , 

7rs{y) = and TTg{y) = 2 

n 

By Lemma IbTTI to prove that dacUve > | log n, it suffices to show that when q < ^ log n and U is a set of n'^ vectors 
chosen uniformly and independently at random from {0, 1}", then with probability at least |, every set S Q U of 
size 15*1 = q and every y G {0, 1}'' satisfy TTs{y) < §2"^. (This is like a stronger version of dcoarse where ds'(7r, vr') 
is replaced with an Lqo distance.) 

Consider a set S of q vectors chosen uniformly and independently at random from {0, 1}". For any vector 
y G {0, l}*^, the expected number of columns of S that are equal to y is n2~'^. Since the columns are drawn 
independently at random, Chemoff bounds imply that 
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By the union bound, the probability that there exists a vector y G {0, l}"^ such that more than |n2 columns of 
S are equal to y is at most 2'^e~75"2 Furthermore, when U is defined as above, we can apply the union bound 
once again over all subsets S C U of size \S\ = q to obtain PrpS, y : irsiy) > |2~'?] < n^'^ ■ 2'^ • e~75"^ '' . 
When q < ^ log n, this probability is bounded above by ea '^+2 logn- which is less than j when n is large 
enough, as we wanted to show. □ 
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6.4 Application: LTFs 



The testing dimension also lets us prove the lower bounds in Theorem l4. 1 [ regarding the query complexity for testing 
linear threshold functions. Specifically, those bounds follow directly from the following result. 

Theorem 6.8. For linear threshold functions under the standard n-dimensional Gaussian distribution, dpassive = 
VL{^Jn/\og{n)) anddactive = ^^((n/ log(n)) ^3). 

Let us give a brief overview of the strategies used to obtain the dpassive and dacUve bounds. The complete proofs 
for both results, as well as a simpler proof that d^oarse = ^{{n/log n)^^^), can be found in Appendix IF.4I 

For both results, we set vr to be a distribution over LTFs obtained by choosing w ~ AA(0, Inxn) and outputting 
f{x) = sgn{w • x). Set vr' to be the uniform distribution over all functions — i.e., for any x e M", the value of /(x) 
is uniformly drawn from {0, 1} and is independent of the value of / on other inputs. 

To bound dpassive, we bound the total variation distance between the distribution of Xwj \/n given X, and a 
normal Af{0, Inxn)- If this distance is small, then so must be the distance between the distribution of sgn(Xt(;) 
and the uniform distribution over label sequences. In fact, we show this is the case for a broad family of product 
distributions, characterized by a condition on the moments of the coordinate projections. 

Our strategy for bounding dacUve is very similar to the one we used to prove the lower bound on the query 
complexity for testing dictator functions in the last section. Again, we want to apply Lemma |B?T] Specifically, 
we want to show that when q < o((n/ log(n))^/'^) and ?7 is a set of vectors drawn independently from the n- 
dimensional standard Gaussian distribution, then with probability at least |, every set 5 C [/ of size |5| = q and 
almost all x G W, we have -Ksix) < §2^^. The difference between this case and the lower bound for dictator 
functions is that we now rely on strong concentration bounds on the spectrum of random matrices 1641 to obtain the 
desired inequality. 

7 Conclusions 

In this work we develop and analyze a model of property testing that parallels the active learning model in machine 
learning, in which queries are restricted to be selected from a given (polynomially) large unlabeled sample. We 
demonstrate that a number of important properties for machine learning can be efficiently tested in this setting with 
substantially fewer queries than needed to learn. These testing algorithms bring together tools from a range of 
areas including U-statistics, noise-sensitivity, and self-correction, and develop characterizations of certain function 
classes that may be of independent interest. We additionally give a combination result allowing one to build testable 
properties out of others, as well as develop notions of intrinsic testing dimension that characterize the number of 
queries needed to test, and which we then use to prove a number of near-matching lower bounds. In the context of 
testing linear separators, for the active testing model we have an 0{y/n) upper bound and an il(n^/^) lower bound; 
it would be very exciting if the upper bound could be improved, but either way it would be interesting to close that 
gap. Additionally, testing of linear- separators over more general distributions would be quite interesting. 
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A Comparison of Active Testing and Other Property Testing Models 

In this section, we compare the active testing model with four existing models of property testing: the standard prop- 
erty testing model as introduced by Rubinfeld and Sudan |[57l . the passive testing model first studied by Goldreich, 
Goldwasser, and Ron |[32l . the tolerant property testing model introduced by Parnas, Ron, and Rubinfeld ||54ll . and 
the distribution-free property testing model of Halevy and Kushilevitz |[35l . 

A.l Standard and Passive Property Testing 

Fix some sets X, Y and let V be some property of functions f : X ^ Y. Let D be some distribution over X. Recall 
that the standard model of property testing is defined as follows. 

Definition A.l (Standai^d Property Tester ||57l ). A q-query (standard) e-tester for V over the distribution D is a 
randomized algorithm A that queries the value of a function / on g of its inputs and then 

1. Accepts with probability at least | when f & V, and 

2. Rejects with probability at least | when doif, V) > e. 

The most commonly-studied case is where the distribution D is uniform over the domain of the function. When 
that is not the case, note that we can assume that the tester knows the distribution D. For the alternate model where 
the tester does not know D, see Section |AT3l 

The passive property testing model is similar to the standard property testing model, except that the queries made 
by the tester in this model ai^e drawn at random from D. 

Definition A.2 (Passive Property Tester |[32l ). A q-query passive e-tester for V over the distribution D is a random- 
ized algorithm A that draws q samples independently at random from D, queries the value of a function / on each 
of these q samples, and then 

1. Accepts with probability at least | when f € V, and 

2. Rejects with probability at least | when doif, V) > e. 
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The query complexity of a property under a given testing model is the minimum query complexity of any tester 
for the property in this model. We denote the query complexity of properties in the standard, passive, and active 
testing models with the following notation. 

Definition A.3 (Query complexity). The query complexity of V over D in the standard property testing model is 

QD,e{'P) ■= min{g > : there exists a g-query e-tester for V}. 

Similarly, the query complexity of V over D in the active and passive testing models is 

Qd eCP) '•— > : there exists a g-query active e-tester for V} 

Q^D e(^) > : there exists a g-query passive e-tester for V}. 

With this notation in place, we can now formally estabUsh the relationship between the standard, active, and 
passive models of property testing. 

Tlieorem A. 4. For every property V, every distribution D, and every e > 0, 

Furthermore, the three testing models are distinct: there exist properties V, distributions D, and constants e > 
such that QD,t{P) < Qi) e(^) '^fi'^ there also exist V, D, e such that Qf) ^(V) < 

Proof. Both inequalities in ([B are obtained with simple arguments. For the first inequality, note that we can always 
simulate an active tester in the standard property testing model by internally sampling^ a random subset of the 
inputs in the domain of the function / and having the active tester select from those inputs. The second inequality 
follows from the fact that we can simulate a passive tester in the active testing model by querying the function on 
the first Q/) ^{'P) samples drawn at random from D. 

The distinctness of the three models of property testing is not as immediate, but it follows from the main results 
in our paper. Theorem 16.71 shows that testing dictatorship in the active testing model requires O(log n) queries. The 
same property can be tested with 0(l/e) queries in the standard testing model ||8j|55l, so this establishes the first 
strict inequality. For the second strict inequality, consider the property of being a union of d intervals. Theorem 13. II 
shows that we can test this property with 0(l/e^) queries in the active testing model but queries are required 

to test the same property in the passive model BTI . □ 

A.2 Tolerant Testing 

The tolerant property testing model is an extension of the standard model of property testing with one extra re- 
quirement: the tester must accept functions with a given property V as well as functions that are (very) close to V. 
Formally, the model is defined as follows. 

Definition A.5 (Tolerant Property Tester ll54l ). Fix < ei < e2 < 1. A q-query tolerant (ei, e2)-tester for V over 
the distribution D is a randomized algorithm A that queries the value of a funciton / on g of its inputs and then 

1. Accepts with probability at least | when doif, V) < ei, and 

2. Rejects with probability at least | when d/)(/, V) > e^- 

Definition A.6. The query complexity of V over D in the tolerant property testing model is 

Q^Dex ti^) '■— > : there exists a (7-query tolerant (ei, e2)-tester for "P}. 

'"Note that here we use the fact that a standard property tester knows the underlying distribution D and can therefore generate samples 
from this distribution "for free". 
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One may ask whether every property that has a query-efficient tolerant tester also has a query-efficient tester in 
the active model. Our lower bound on the query complexity for testing dictator functions in the active model gives 
a negative answer to this question: there are properties that require significantly more queries to test in the active 
model than in the tolerant testing model. 

Theorem A.7. There exist V, D, and < ei < e2 < I for which Q^D.e^^^^{V) < ^^{V). 

Proof. Consider the property V of being a dictator function and let D be the uniform distribution over the hypercube. 
Theorem 16.71 shows that Qfj^^iV) = Q{logn). By contrast, standard testers for dictator functions lEl |55l are 
tolerant (ei, e2)-testers with query complexity 0(l/(e2 — ei)^) so the inequality in the theorem statement holds 
when £2 — ei = 0(1)- □ 

We believe that the tolerant and active property testing models are incomparable — i.e., that there exist properties 
V (along with distributions D and parameters ei < £2) for which the inequality in Theorem IA.7I is reversed and 
Qd^ci £2 ^ £2 leave the proof (or disproof) of this assertion as an open problem. 

A.3 Distribution-free testing 

In the above property testing models, the tester knows the underlying distribution D. To model the scenario where 
the tester does not know D, Halevy and Kushilevitz |[35l introduced the distribution-free testing model. (See also |[33l 
[MIEIHSI.) The model is defined formally as follows. 

Definition A.8 (Distribution-free Tester |[35l ). An s-sample, q-query distribution-free e-tester for "P is a random- 
ized algorithm A that draws s independent samples from the (unknown) distribution D, queries the value of the 
(unknown) function / on those s samples and q — s additional inputs of its choosing, and then 

1. Accepts with probability at least | when / G and 

2. Rejects with probability at least | when d£){f, V) > e. 

Definition A.9. The query complexity of the property V in the distribution-free model is 

Q^^iV) := min{g > : for some Q < s < q, there exists an s-sample, g-query distribution-free e-tester for V}. 

Superficially, the distribution-free and active testing models appear to be similar: in both models, the tester first 
samples the underlying distribution D and then queries the value of the function on some inputs. The challenges in 
the two models, however, are mostly orthogonal and, as a result, the two models of property testing are incomparable. 
This statement is made precise by the following two results. 

Theorem A. 10. There exist properties V such that for every distribution D and every large enough constant e > 0, 

Proof. Fix a large enough fi > and let V be the property consisting of the set of unions of d intervals. Theorem |3.1| 
shows that for every distribution D, we have QJ, ^{V) = 0(l/e^). To complete the proof of the theorem, we now 

show that Qf{V) = n{Vd). 

Consider the following two distributions on pairs of functions / : [0, 1] — )• {0, 1} and distributions D on [0, 1]. 
For the distribution J-"ycs, choose a set 5 of d points sampled independently and uniformly at random from [0, 1]. 
Define D to be the uniform distribution over S, and let / : [0, 1] — >• {0, 1} be a random function defined by choosing 
f{x) uniformly at random for every x £ S and setting f{x) = for all x G [0, 1] \ S. Clearly, every such function 
/ is a union of d intervals. 

The distribution Tno is defined similarly except that in this case we let 5 be a set of lOd points. We define D 
to be uniform over S and again define / : [0, 1] — )• {0, 1} by choosing f{x) uniformly at random for all x G 5 and 
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setting /(x) = for all remaining points. In this case, whp the resulting functions are far from unions of d intervals 
over D2. 

Let ^ be a distribution-free tester for unions of d intervals. The tester A must accept with high probability 
when we draw a function / and distribution D from J^ycs and it must reject with high probability when instead we 
draw a function and distribution from Tno- Clearly, querying the functions on points that were not drawn from the 
distribution D will not help A since with probability 1 it will observe f{x) = on those points. Assume now that 
A makes s = o{Vd) draws to the distribution D. By the birthday paradox, with probability 1 — o(l), the s samples 
drawn from the distribution are distinct. In this case, the distributions on the values of the function on those s inputs 
are uniformly random so it has no way to distinguish whether the input was drawn from Tycs or from Tno- This 
contradicts the assumption that A is a. valid distribution-free tester for unions of d intervals and completes the proof 
of the lower bound on QfiV). □ 

Theorem A. 11. There exist properties V, distributions D, and parameters e > such that Q'^^iV) < Qjj ^{'P). 

Proof. Let V be the property of being a dictator function, let D be the uniform distribution over the hypercube, and 
let e > be some constant. Theorem |6.7| shows that ^{V) = Q{logn). By contrast, Halevy and Kushilevitz |[35l 
showed that it is possible to test dictator functions in the distribution-free model with a constant number of queries 
when e is constant and so Q'^^i'P) = 0(1). □ 



B Proof of a Property Testing Lemma 

The following lemma is a generalization of a lemma that is widely used for proving lower bounds in property 
testing |[27l Lem. 8.3]. We use this lemma to prove the lower bounds on the query complexity for testing dictator 
functions and testing Unear threshold functions. 

Lemma B.l. Let vr and tt' be two distributions on functions X — )■ M. Fix U ^ X to be a set of allowable queries. 
Suppose that for any S '^U,\S\ = q, there is a set Es ^ M'^ (possibly empty) satisfying -KsiEs) < ■|2~'' such that 

T^siy) < l-^'siy) for every y eR'^\Es. 
Then err*(DTq, Fair(7r, vr', [/)) > 1/4. 

Proof. Consider any decision tree A of depth q. Each internal node of the tree consists of a query y £ U and a 
subset T C M such that its children are labeled by T and M \ T, respectively. The leaves of the tree are labeled with 
either "accept" or "reject", and let L be the set of leaves labeled as accept. Each leaf I ^ L con^esponds to a set 
Si C m of queries and a subset C M^, where / : X M leads to the leaf I iff f{Si) G T^. The probability that 
A (correctly) accepts an input drawn from vr is 

oi = X] / '^se{y)dy- 

Similarly, the probability that A (incorrectly) accepts an input drawn from vr' is 

= X] / ^'sM^y- 

The difference between the two rejection probabilities is bounded above by 

ai - ^2 < V / T^siiy) - T^'st{y)dy + T^si{y)dy. 
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The conditions in the statement of the lemma then imply that 

To complete the proof, we note that A errs on an input drawn from Fair(7r, vr', U) with probability 

\{l - ai) + \a2 = \ - \{ai - a2) > \. □ 

C Proofs for Testing Unions of Intervals 

In this section we complete the proofs of the technical results in Section [3j 

Proposition 13.31 (Restated). Fix 5 > and let f : [0, 1] — {0, 1} be a union of d intervals. Then N§>s{f) < d6. 
Proof. For any fixed b € [0, 1], the probability that x < b < y when x ~ U{0, 1) and y ~ U{x — 6, x + 6) is 

6 — t 6 

Pr[x <b <y]= Pr \y > b]t = — — t = -. 

Similarly, Pr^; < b < x] = j. So the probability that b lies between x and y is at most |. 

When / is the union of d intervals, f{x) ^ f{y) only if at least one of the boundaries 6i, . . . , b2d of the intervals 
of / lies in between x and y. So by the union bound, Pr[/(x) ^ f{y)] < 2d{5/2) = d5. Note that if b is within 
distance of or 1, the probability is only lower. □ 

Lemma IX4l (Restated). Fix 5 = Let f : [0,1] — )• {0,1} be any function with noise sensitivity NSsif) < 
d5{l + |). Then f is e-close to a union ofd intervals. 

Proof. The proof proceeds in two steps: We first show that / is |-close to a union of d(l + |) intervals, then we 
show that every union of d{l + |) intervals is |-close to a union of d intervals. 
Consider the "smoothed" function : [0, 1] — [0, 1] defined by 

f,{x)=W.y^,^f{y) = ^J^'^' f{y)y. 

The function fs is the convolution of / and the uniform kernel : M — [0, 1] defined by (l){x) = < 5\. 

Fix T = ^N§5(/). We introduce the function g* : [0, 1] {0, 1, *} by setting 



1 when fs{x) > 1 — r, 
when fs{x) < r, and 
* otherwise 



for all X G [0, 1]. Finally, we define g : [0, 1] — {0, 1} by setting g{x) = g*{y) where y < x is the largest value for 
which g{y) ^ *. (If no such y exists, we fix g{x) = 0.) 

We first claim that dist{f, g) < ^.To see this, note that 

dist{f,g) = Fr[f{x)^g{x)] 

X 

< Pr[g*{x) = *] + Pr[/(2;) = OAg*{x) = l]+ Pr[/(x) = lAg*{x)= 0] 

= Pr[T < fsix) < 1 - r] + Pr[/(x) = A fsix) > 1 - r] + Pr[/(x) = 1 A fsix) < r]. 
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We bound the three terms on the RHS individually. For the first term, we observe that NSs{f, x) = mm{fs{x), 1 — 
fs{x)} and that Ea;NS5(/, x) = N§>s{f). From these identities and Markov's inequality, we have that 

Pr[r < fs{x) < 1 - r] = Pr[NS5(/,x) > r] < = ^. 

X X 7-4 

For the second term, let S C [0, 1] denote the set of points x where f{x) = and fs{x) > 1 — r. Let T C 5" 
represent a 6-net of S. Clearly, |r| < |. For x G F, let = {x — 6, x + 6) he a ball of radius 6 around x. Since 
fsix) > 1 — r, the intersection of S and has mass at most \S H B^l < t6. Therefore, the total mass of S is at 
most \ S\ < \r\T6 = T. By the bounds on the noise sensitivity of / in the lemma's statement, we therefore have 



Pr[/(x) = 0A/5(x)>l-T]<T< 



8- 



Similarly, we obtain the same bound on the third term. As a result, dist{f, (7) < | + | + | = |, as we wanted to 
show. 

We now want to show that (7 is a union of in < dS{l + |) intervals. Each left boundary of an interval in g occurs 
at a point x € [0, 1] where g*{x) = *, where the maximum y < x such that g*{y) 7^ * takes the value g*{y) = 0, 
and where the minimum z > x such that g*{z) / * has the value g*{z) = 1. In other words, for each left boundary 
of an interval in g, there exists an interval {y,z) such that fs{y) < t, fs{z) > 1 — r, and for each y < x < z, 
fsix) S (t, 1 — r). Fix any interval {y, z). Since fs is the convolution of / with a uniform kernel of width 26, it 
is Lipschitz continuous (with Lipschitz constant ^). So there exists x G (y, z) such that the conditions fs{x) = ^, 
X — y > 25(i — r), and z — x > 26{^ — r) all hold. As a result, 

/ NSsif,t)t= msif,t)t+ NSsif,t)t>26{l-Tf. 

J y J y J X 

Similarly, for each right boundary of an interval in g, we have an interval (y, z) such that 

f m5{f,t)\>25{\-Tf. 

Jy 

The intervals (y, z) for the left and right boundaries are all disjoints, so 

2m 



m^U) >izf m{f,t)t> 2m^-{l - 2Tf 



=1 

This means that 

d6iX + e/4) 

and g is a union of at most d(\ + |) intervals, as we wanted to show. 

Finally, we want to show that any function that is the union of m < d{l + |) intervals is |-close to a union of 
d intervals. Let ii, . . . ,£rn represent the lengths of the intervals in g. Clearly, £1 + ■ ■ ■ + im < 1, there must be a 
set S" of m — d < de/2 intervals in / with total length 

k -rf(i+f)^2- 

Consider the function h : [0, 1] — {0, 1} obtained by removing the intervals in 5 from g (i.e., by setting h{x) = 
for the values x G [621-1, 62*] for some i G S). The function /i is a union of d intervals and dist{g, h) < |. This 
completes the proof, since dist{f, h) < dist{f, g) + dist{g, h) < e. □ 
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D Proofs for Testing LTFs 

We complete the proof that LTFs can be tested with 0{^/n) samples in this section. 



D. l Proof of Lemma |43] 

The proof of Lemma 1431 uses the Hermite decomposition of functions. We begin by introducing this notion and 
related definitions. 

Definition D.l. The Hermite polynomials are a set of polynomials hQ{x) = 1, hi{x) = x, /i2(x) = -^{^^ — 1)) • • • 
that form a complete orthogonal basis for (square-integrable) functions / : M — M over the inner product space 
defined by the inner product (/, g) = Kx[f{x)g{x)], where the expectation is over the standai^d Gaussian distribution 
AA(0,1). 

Definition D.2. For any 5 G N", define Hs = YYi=i ^Si (^«)- T^^^ Hermite coefficient of / : — )• M corresponding 

to S is f{S) = (/, Hs) = 'Ex[f{x)Hs{x)] and the Hermite decomposition of / is f{x) = J^seN^ f{S)Hs{x). The 
degree of the coefficient f{S) is |5| := J27=i Si- 

The connection between linear threshold functions and the Hermite decomposition of functions is revealed by 
the following key lemma of Matulef et al. B9l . 

Lemma D.3 (Matulef et al. II49II ). There is an explicit continuous function 1/F : M — > M with bounded derivative 
||W^'||oo < 1 '^nd peak value W{0) = ^ such that every linear threshold function f : R" — )• { — 1,1} satisfies 
EiLi /(e*)^ = W{E^f). Moreover, every function 5 : M" ^ {-1, 1} that satisfies |EiLi5(e*)^ - W{E^g)\ < 
4e^, is e-close to being a linear threshold function. 

In other words, Lemma lDJl shows that J2i fi^i)^ characterizes linear threshold functions. To obtain Lemma l431 
it suffices to show that this sum is equivalent to ¥,x^y[f{x)f{y) {x, y)]. This identity is easily obtained: 

Lemma D.4. For any function / : M" — ^ M, we have Y17=i fi^i)"^ = '^x,y[f{x)f{y) {x, y)]. 
Proof. Applying the Hermite decomposition of / and linearity of expectation, 

n 

E.,y[f{x)f{y){x,y)] = Y^ E f{S)f{T)E,[Hs{x)xi]Ey[HT{y)y^]. 

By definition, Xi = hi{xi) = H(.^{x). The orthonormality of the Hermite polynomials therefore guarantees that 

E, [Hs{x)He^{x)] = l[S = e^]. Similai-ly, Ey[HT{y)yi] = l[r = e,]. □ 

D.2 Analysis of LTF Tester 

We now complete the analysis of the LTF TESTER algorithm. 

For a fixed function / : R'^ -> M, define 5 : R" x M" ^ M to be g{x, y) = f{x)f{y) {x, y). Let 5* : M" x M" 
R be the truncation of g defined by setting 

*, N i f{x)f{y){x,y) if I (x,y) | < y/Anlog{An/e^) 
g [x,y) = < 

I otherwise. 

Our goal is to estimate Kg. The following lemma shows that Kg* provides a good estimate of this value. 
Lemma D.5. Let g, g* : R"- x W ^ R be defined as above. Then \Kg -Kg*\< ^e^. 
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Proof. For notational clarity, fix r = y^4n \og{An/e^). By tlie definition of g and g* and witli tlie trivial bound 

\f{x)f{y) {x, y) I < n we have 



1% - E5* 



Pr[|(x,y)| >r] • E^, (x, y) | \{x,y)\ > rl < n • Pr [ y)| > r] . 

X',?/ L J 



The right-most term can be bounded with a standard Chernoff argument. By Markov's inequality and the indepen- 
dence of the variables xi, . . . ,Xn,yi, ■ ■ ■ ,yn, 



tr ' 



x,y 

The moment generating function of a standard normal random variable is Ee*^ = e*^/^, so 

When X ~ A/'(0, 1), the random variable has a distribution with 1 degree of freedom. The moment generating 
function of this variable is Ee*^^ = J = Jl + for any t < | . Hence, 



E.,e(*V2).f < ^1 + < 
for any t < 1. Combining the above results and setting t = yields 

Pr [(x,y) > r] < e5(^"*^ < = g. 

The same argument shows that Pr[(x, y) < — r] < as well. □ 

The reason we consider the truncation g* is that its smaller £00 norm will enable us to apply a strong Bemstein- 
type inequality on the concentration of measure of the U-statistic estimate of Kg* . 

Lemma D.6 (Arcones Q). For a symmetric function /i : M" x M" M, let = E^.[Ej^[/i(x, y)Y]-Kx,y[h{x, y)]^, 
let b = \\h — E/i||oo. and let Um{h) be a random variable obtained by drawing x^, . . . , independently at random 
and setting Umih) = (^) Sj<j h{x^,x^). Then for every t > 0, 



Pr[|[/^(/i) -E/i| > t] < 4exp 



^8S2 + imht J ' 

We are now ready to complete the proof of the upper bound of Theorem 14.11 



Theorem D.7 (Upper bound in Theorem 14.11 restated). Linear threshold functions can be tested over the standard 
n-dimensional Gaussian distribution with 0(^\Jn logn) queries in both the active and passive testing models. 

Proof. Consider the LTF-Tester algorithm. When the estimates jl and z> satisfy 

|/i-E/|<e3 and |i> - E[/(x)/(y) (x, < e^, 

Lemmas ID. 31 and ID. 41 guarantee that the algorithm correctly distinguishes LTFs from functions that are far from 
LTFs. To complete the proof, we must therefore show that the estimates are within the specified eiTor bounds with 
probability at least 2/3. 
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The values /(x^), . . . , /(x™) are independent { — 1, l}-valued random variables. By Hoeffding's inequality, 

Pr[|/i - E/l < e^] > 1 - 2e-^'™/2 ^ ^ _ 2e^O{V^\ 
The estimate i> is a U-statistic with kernel g* as defined above. This kernel satisfies 



\\g* - %*||oo < 2||(7*||oo = 2V4nlog(4n/e3) 

and 

S2 < W.y [E,b* (x, = W.y [E,. (x, y) 1[| (x, y) | < t]]^] . 

For any two functions 0, : M" — M, when t/; is {0, l}-valued the Cauchy-Schwarz inequality implies that 

E,[,/.(x)^(x)]2 < E,[</.(x)]E,[0(x)^(x)2] = E,[,/.(x)]E,[</.(x)^(x)] and so E,.[0(x)V(x)]2 < E,[,^(x)]. Ap- 
plying this inequality to the expression for gives 

n n 

S2 < E,[E,[/(x)/(y) {x,y)f] = E, [( /(2/)y,E,[/(x)x,])'] = /(e,)/(e,•)Ey[y^yi] = ^^/(ei)'- 

1=1 j,j 1=1 

By Parseval's identity, we have fiaf < \\f\\l = \\f\\^ = 1. Lemmas |D3] and |D3]implv that 

Pr[|z> - E^l < e^] = Pr[|z> - E^*] < ^e^] > 1 - 4e 8+2ooV™iog(4u/.3)t > 
The union bound completes the proof of correctness. □ 



E Proofs for Testing Disjoint Unions 

Theorem 15.11 (Restated). Given properties Vi, . . . ,Vn> if each Vi is testable over Di with q{e) queries and U{e) 
unlabeled samples, then their disjoint union V is testable over the combined distribution D with 0{q{e/2) ■ (log'^ ^)) 
queries and 0{U{e/2) ■ log^ ^)) unlabeled samples. 

Proof. Let p = (pi, . . . ,Pn) denote the mixing weights for distribution D; that is, a random draw from D can be 
viewed as selecting i from distribution p and then selecting x from Di. We are given that each Vi is testable with 
failure probability 1/3 using using q{e) queries and U (e) unlabeled samples. By repetition, this implies that each is 
testable with failure probability 6 using ^^(e) = 0{q{e) log(l/(5)) queries and Us{e) = 0{U (e) log(l/5)) unlabeled 
samples, where we will set 5 = e^. We now test property V as follows: 

Fore' = 1/2,1/4,1/8,... ,e/2do: 
Repeat log(l/e)) times: 

1. Choose a random (i, x) from D. 

2. Sample until either Us{e') samples have been drawn from Di or {8N/e)Us{e') samples total have 
been drawn from D, whichever comes first. 

3. In the former case, run the tester for property Vi with parameter e', making ^^(e') queries. If the 
tester rejects, then reject. 

If all runs have accepted, then accept. 



24 



First to analyze the total number of queries and samples, since we can assume q{e) > 1/e and U{e) > 1/e, we have 
qs{e')e'/e = 0{qs{e/2)) and Us{e')e' /e = 0{Us{e/2)) for e' > e/2. Thus, the total number of queries made is at 
most 

^ qs{e/2) log(l/e) = O [q{e/2) ■ log^ 
and the total number of unlabeled samples is at most 

^Usie/2) log(l/e) = O (u{e/2)^ log' ^) . 

Next, to analyze correctness, if indeed f G V then each call to a tester rejects with probability at most 6 so the 
overall failure probability is at most ((5/e) log^(l/e) < 1/3; thus it suffices to analyze the case that (iist£)(/, "P) > e. 

If dist£,{f,V) > e then J2rpi>e/{AN)Pi ' distDi{fi,'Pi) > 3e/4. Moreover, for indices i such that pj > e/(4A^), 
with high probability Step 2 draws Us{e') samples, so we may assume for such indices the tester for Vi is indeed run 
in Step 3. Let I = {i : Pi> e/{4,N) and distj:,i{fi,Vi) > e/2}. Thus, we have 

YPi-distnAfi^Vi) > e/4. 

Let I^' = {i € I : distOiifijVi) G [e', 2e']}. Bucketing the above summation by values e' in this way implies that 
for some value e' G {e/2, e, 2e, . . . , 1/2}, we have: 

Yp^>e/{8e'log{l/e)). 

This in turn implies that with probability at least 2/3, the run of the algorithm for this value of e' will find such an i 
and reject, as desired. □ 



F Proofs for Testing Dimensions 

El Passive Testing Dimension (proof of Theorem lOb 

Lower bound: By design, dpassive is a lower bound on the number of examples needed for passive testing. 
In particular, if d5(7r,7r') < 1/4, and if the target is with probability 1/2 chosen from vr and with probability 
1/2 chosen from vr', even the Bayes optimal tester will fail to identify the correct distribution with probability 
^ Z^y6{o ijisi ^5(2/)) = ^(1 ~ ds'(7r,7r')) > 3/8. The definition of dpassive implies that there exist 

TT G Ho, vr' G He such that Pr5(ds(7r, vr') < 1/4) > 3/4. Since vr' has a 1 — o(l) probability mass on functions that 
are e-fai^ from V, this implies that over random draws of S and /, the overall failure probability of any tester is at 
least (1 — o(l))(3/8)(3/4) > 1/4. Thus, at least dpassive + 1 random labeled examples are required if we wish to 
guarantee enw at most 1/4. This in turn implies ^.{dpassive) examples are needed to guarantee enw at most 1/3. 

Upper bound: We now argue that O (dpassive) examples are sufficient for testing as well. Toward this end, consider 
the following natural testing game. The adversary chooses a function / such that either / G "P or dist£){f,V) > e. 
The tester picks a function A that maps labeled samples of size k to accept/reject. That is, A is a deterministic 
passive testing algorithm. The payoff to the tester is the probability that A is correct when S is chosen iid from D 
and labeled by /. 

If k > dpassive then (by definition of dpassive) we know that for any distribution vr over / G "P and any dis- 
tribution vr' over / that ai^e e-far from V, we have Prg^^k^dsiT^jir') > 1/4) > 1/4. We now need to translate 
this into a statement about the value of the game. Note that any mixed strategy of the adversary can be viewed as 
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an + (1 — a)7r' for some distribution vr over / G some distribution vr' over / that are e-far from V and some 
a > 0. The key fact we can use is that against such a mixed strategy, the Bayes optimal predictor has error exactly 

^ min(a7r5(y), (1 - a)TT's{y)) < max(a, 1 - a) ^ mm{irs{y),7r's{y)), 
y y 

while 

min(7r5(y), 7r^(y)) = 1 - (1/2) ^ |7r5(y) - 7:'s{y)\ = 1 - ds{n, vr'), 
y y 

so that the Bayes risk is at most max(a, 1 — a)(l — d5(7r, vr')). Thus, for any a € [7/16, 9/16], if d5(vr, vr') > 1/4, 
the Bayes risk is less than (9/16)(3/4) = 27/64. Furthermore, any a ^ [7/16, 9/16] has Bayes risk at most 7/16. 
Thus, since d5(vr, vr') > 1/4 with probability > 1/4 (and if ds(vr, vr') < 1/4 then the error probability of the Bayes 
optimal predictor is at most 1/2), for any mixed strategy of the adversary, the Bayes optimal predictor has risk less 
than (1/4) (7/16) + (3/4) (1/2) = 31/64. 

Now, applying the minimax theorem we get that for k = dpassive + 1> there exists a mixed strategy A for the 
tester such that for any function chosen by the adversary, the probability the tester is correct is at least 1/2 + 7 for a 
constant 7 > (namely, 1/64). We can now boost the con^ectness probability using a constant-factor larger sample. 
Specifically, let m = c • {dpassive + 1) for some constant c, and consider a sample 5 of size in. The tester simply 
partitions the sample S into c pieces, runs A separatately on each piece, and then takes majority vote. This gives us 
that 0{dpassive) examples are sufficient for testing with any desired constant success probability in (1/2, 1). 

F.2 Coarse Active Testing Dimension (proof of Theorem |6^ 

Lower bound: First, we claim that any nonadaptive active testing algorithm that uses < dcoarse/c label requests 
must use more than unlabeled examples (and thus no algorithm can succeed using o{dcoarse) labels). To see this, 
suppose algorithm A draws n'^ unlabeled examples. The number of subsets of size dcoarse/c is at most n'^'^oarse^Q 
(for dcoarse/c > 3). So, by definition of dcoarse and the union bound, with probability at least 5/6, all such subsets 
S satisfy the property that ds(vr, vr') < 1/4. Therefore, for any sequence of such label requests, the labels observed 
will not be sufficient to reliably distinguish vr from vr'. Adaptive active testers can potentially choose their next point 
to query based on labels observed so fai^, but the above immediately implies that even adaptive active testers cannot 
use an o{log{dcoarse)) queries. 

Upper bound: For the upper bound, we modify the argument from the passive testing dimension analysis as follows. 
We are given that for any distribution vr over / G "P and any distribution vr' over / that ai^e e-fai" from V, for 
k = dcoarse + l, we have Pr5.^£)fc(d5(vr, vr') > 1/4) > n~'^. Thus, we can sample [/ ~ D'^withm = 9(A;-n'^), and 
partition U into subsamples Si, S2, ■ ■ ■ , Scn^ of size k each. With high probability, at least one of these subsamples 
5i will have d5(vr, vr') > 1/4. We can thus simply examine each subsample, identify one such that ds'(vr, vr') > 1/4, 
and query the points in that sample. As in the proof for the passive bound, this implies that for any strategy for 
the adversary in the associated testing game, the best response has probability at least 1/2 + 7 of success for some 
constant 7 > 0. By the minimax theorem, this implies a testing strategy with success probability 1/2 + 7 which can 
then be boosted to 2/3. The total number of label requests used in the process is only 0{dcoarse)- 

Note, however, that this strategy uses a number of unlabeled examples f](n'^'=°'"''"=+^). Thus, this only implies an 
active tester for dcoarse = 0{1). Nonetheless, combining the upper and lower bounds yields Theorem 16.41 

F.3 Active Testing Dimension (proof of Theorem 16.61 ) 

Lower bound: for a given sample U, we can think of an adaptive active tester as a decision tree, defined based on 
which example it would request the label of next given that the previous requests have been answered in any given 
way. A tester making k queries would yield a decision tree of depth k. By definition of dactive{u), with probability 
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at least 3/4 (over choice of U), any such tester has eiTor probability at least (1/4)(1 — o(l)) over the choice of /. 
Thus, the overall failure probability is at least (3/4) (1/4) (1 - o(l) > 1/8. 

Upper bound: We again consider the natural testing game. We ai^e given that for any mixed strategy of the adversary 
with equal probability mass on functions in V and functions e-far from V, the best response of the tester has expected 
payoff at least (1/4) (3/4) + (3/4) (1/2) = 9/16. This in turn implies that for any mixed strategy at all, the best 
response of the tester has expected payoff at least 33/64 (if the adversary puts more than 17/32 probability mass on 
either type of function, the tester can just guess that type with expected payoff at least 17/32, else it gets payoff at 
least (1 — 1/16)(9/16) > 33/64). By the minimax theorem, this implies existence of arandomized strategy for the 
tester with at least this payoff. We then boost correctness using c • u samples and c • dactive {u) queries, running the 
tester c times on disjoint samples and taking majority vote. 

F.4 Lower Bounds for Testing LTFs (proof of Theorem 

We complete the proofs for the lower bounds on the query complexity for testing linear threshold functions in 
the active and passive models. This proof has three parts. First, in Section |F.4.1[ we introduce some preliminary 
(technical) results that will be used to prove the lower bounds on the passive and coarse dimensions of testing 
LTFs. In Section |F.4.2[ we introduce some more preliminary results regarding random matrices that we will use to 
bound the active dimension of the class. Finally, in Section |F.4. 3 [ we put it all together and complete the proof of 
Theorem |6.8l 

The high level idea of our proof is we will show that for a random LTF given by weight vector w^~A^(0,/„xn)> 
even if we are given the exact value w • x for each example x (rather than just sgn{'w ■ x)), we still could not 
distinguish these values from random Gaussian noise. Towards this end, for two distributions P, Q over TZ^, we 
use ||P — Qll to denote the total variation distance between them. For instance, given two distributions vr, vr' over 
boolean functions, and given a sample S, we have ^^(Tr, vr') = Hvr^ — 7r|^||. 

F.4.1 Preliminaries for dpassive and dcoarse 

Fix any K. Let the dataset X = {xi, X2, • • • , xk} be sampled iid according to a A^(0, Inxn) distributioiQ Let 
X G TZ^^^ be the con^esponding data matrix. 
Suppose w ~ A/'(0, Inxn)- We let 

z = Xw, 

and note that the conditional distribution of z given X is normal with mean and (X-dependent) covariance matrix, 
which we denote by S. Further applying a threshold function to z gives y as the predicted label vector of an LTF. 

Lemma F.l. For any square non-singular matrix B, log{det{B)) = Tr{log{B)), where \og{B) is the matrix 
logarithm of B. 

Proof. From |[38l . we know since every eigenvalue of A corresponds to the eigenvalue of exp(^), thus 

(iet(exp(A)) = exp(rr(A)) (2) 

where exp(A) is the matrix exponential of A. Taking logarithm of both sides of Q, we get 

log((iet(exp(A))) = Tr{A) (3) 

Let B = exp(A) (thus A = \og{B)). Then © can rewritten as \og{det{B)) = Tr{\og B). □ 

Fixing X, let Pz/^\x denote the conditional distribution over zj \Jn given by choosing w ~ A^(0, /„xn) and 
letting z = Xw. 

"in fact, essentially the same argument would work for many other product distributions, including uniform on{— 1,+!}" 
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Lemma F.2. For sufficiently large n, and a value K = Vt{^Jn/ \og{K/5) ), with probability at least 1 — 5 (over X), 

l|lP(z/v/H)|x-AA(0,/)|| < 1/4. 
Proof. For sufficiently large n, for any pair Xj and Xj, by Bernstein's inequality, with probability 1 — 6', 



xf Xj G 



-2ynlog|,2^nlog| 



for i 7^ j, while concentration inequalities for random variables B6l imply that with probability 1 — 6', 
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n - 2y n log — , n + 2 y n log — + 2 log - 



for i = j. By the union bound, setting 5' = d/K"^, the above inclusions hold simultaneously for all i,j with 
probability at least 1 — 5. For the remainder of the proof we suppose this (probability 1 — 5) event occurs. 
For i / j. 
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Let /3 = 2Y^iHgM!M + . Thus S is a K x K matrix, with llu G [1 - /3, 1 + /3] for i 

G [-/3,/3]foralH/ j. 

Let Pi = 7V(0, S^^^) and P2 = AA(0, /^'^^'). As the density 

1 . 1 



1, • • • ,K and 



and the density 



Pi(z) = — exp( — z"^S ""^z) 

^ ' V(27r)^'det(S) 2 ^ 



P2(z) = — exp( — z"^z) 

^ ^ \/(2^ 2 ^ 
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Then Li distance between the two distributions Pi and P2 

J \dP2 -dPi\< 2y/K{Pi,P2) = 2^(1/2) log det(S), 

where this last equahty is by ES. By LemmaEIl log(det(i;)) = Tr(log(S)). Write A = T, - I. By the Taylor 
series 

00 ^ 00 ^ 

iog(/ + ^) = - ^ -(/-(/ + A)y = -Y, -i-AY 

i=l i=l 

Thus, 

00 

rr(log(/ + A)) = ^-Trii-AY). (4) 

1=1 

Every entry in A^ can be expressed as a sum of at most terms, each of which can be expressed as a product of 
exactly i entries from A. Thus, every entry in A^ is in the range [— K*~^/3*]. This means Tr{A'^) < K^j5^. 

Therefore, if K(i < 1/2, since Tr{A) = 0, the expansion of Tr(log(/ + A)) < YT=i ^'f^' = O (ka ^^^^ 



In particular-, for some K = ^{^Jn/ \og{K/5)), Tr (log (I + A)) is bounded by the appropriate constant to 
obtain the stated result. □ 



F.4.2 Preliminaries for d, 



active 



Given an n x m matrix A with real entries {aij}je[n] jeH' the adjoint (or transpose - the two are equivalent 
since A contains only real values) of A is the m x n matrix A* whose (i, j)-th entry equals j. Let us write 
-^1 > ^2 > • • • > Am to denote the eigenvalues of \/A*A. These values are the singular values of A. The 
matrix A* A is positive semidefinite, so the singular- values of A are all non-negative. We write Amax(^) = Ai and 
Amin(^) = Am to represent its largest and smallest singular values. Finally, the induced norm (or operator norm) of 

\\A\\ = max — — - — = max h- 

xeK'"\{0} ||x||2 a;eM™:||x||2=l 

For more details on these definitions, see any standard linear algebra text (e.g., |[59l ). We will also use the following 
strong concenti"ation bounds on the singular- values of random matrices. 



Lemma F.3 (See 1164 [ Cor. 5.35]). Let Abe annxm matrix whose entries are independent standard normal random 
variables. Then for any t > 0, the singular values of A satisfy 

V^- Vm-t < Amm(^) < Amax(^) < + t (5) 

with probability at least 1 — 2e~*^/^. 

The proof of this lemma follows from Talagrand's inequality and Gordon's Theorem for Gaussian matrices. 
See ||64l for the details. The lemma implies the following corollary which we will use in the proof of our theorem. 

Corollary F.4. Let A be an n x m matrix whose entries are independent standard normal random variables. For 
any < t < \fn — ^Jm, the mx m matrix \ A* A satisfies both inequalities 

UA*A - III < and det i^A*A) >e ^ ^ ^ ) (6) 

lift II / fy^ \ ih / 

with probability at least 1 — 2e~*^/^. 



n 
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Proof. When there exists < z < 1 such that 1 — z < -^Aniax(^) < 1 + -s, the identity ■^Amax(^) = II "^^1 
max||^lj2^;^ ||-^A2;||2 implies that 

2 

l-2z<{l- zf < max -^Ax ^<{l + zf<l + 3z. 



^Ili=i 



2 



These inequahties and the identity = max||^.||2^]^ ||^74x||| - 1 imply that -22: < < 3z. 

Fixing z = ^^^^ and applying Lemma |F3] completes the proof of the first inequality. 
Recall that Ai < • • • < Am are the eigenvalues of \/A*A. Then 



det(l^M) 



det(^/lM)2 _ (Ai---A™)2 (XfY _ (XraUA) 
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Lemma |F31 and the elementary inequality 1 + x < complete the proof of the second inequality. □ 
F.4.3 Proof of Theorem lU 

Theorem 16.81 (Restated). For linear threshold functions under the standard Gaussian distribution in W^, dpassive = 
n{y/n/ log(n)) anddactive = fi((n/ log(n))^/^). 

Proof. Let K be as in Lemma |F2] for 5 = 1/4. Let D = {{xi,yi), . . . , {xk^Vk)} denote the sequence of la- 
beled data points under the random LTF based on w. Furthermore, let D' = {{xi,y'i), . . . , {xk^u'k)} denote 
the sequence of labeled data points under a target function that assigns an independent random label to each 
data point. Also let Zj = (l/-^/n)w-^Xj, and let z' ~ N{0, Ikxk)- Let E = {{xi,zi), . . . ,{xk,zk)} and 
E' = {{xi,z[), . . . , {xk,'^'k)}- Note that we can think of Ui and y'^ as being functions of Zj and z'-, respectively. 
Thus, letting X = {xi, . . . , Xi^}, by Lemma lF2l with probability at least 3/4, 

\\^D\X - Fd'IxII < \\^E\X - ^E'\X\\ < 1/4. 



This suffices for the claim that dpassive = ^{K) = Q,{y^n/ log{n)). 

Next we turn to the lower bound on dacUve- Let us now introduce two distributions Vycs and Pno over linear 
threshold functions and functions that (with high probability) are far from Unear threshold functions, respectively. 
We draw a function / from Pyes by first drawing a vector w ~ M{0, Inxn) from the n-dimensional standard normal 
distribution. We then define f : x ^ sgn(-^x • w). To draw a function g from Pno. we define g{x) = sgn(ya:) 
where each y^; variable is drawn independently from the standard normal distribution AA(0, 1). 

Let X G M"^"? be a random matrix obtained by drawing q vectors from the n-dimensional normal distribution 
A/'(0, Inxn) and setting these vectors to be the columns of X. Equivalently, X is the random matrix whose entries 
are independent standard normal variables. When we view X as a set of q queries to a function / ~ Pycs or a 
function g ~ Pno, we get /(X) = sgn(^Xw) and 5r(X) = sgn(yx). Note that ^Xw ~ M{Q, ^X*X) and 

yx ~ A/'(0, Iqxq). To apply Lemma IbTT] it suffices to show that the ratio of the pdfs for both these random variables 
is bounded by | for all but ^ of the probability mass. 

The pdf p : M'^ — M of a q-dimensional random vector from the distribution Mqxg{0, is 

p{x) = (27r)-§det(S)-5e-5^'^S"'^. 

Therefore, the ratio function r : — )• M between the pdfs of -^Xw^ and of yx is 

r{x) = det(iX*X)-ie^-"«^^*^)"-^)^ 
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Note that 

x^((iX*X)-i - I)x < ||(iX*X)-i - = ||iX*X - 

so by Lemma IF3] with probabihty at least 1 — 2e~* we have 



r{X) < e \ V y V 

By a union bound, for U ~ A/'(0, Inxn)^, n G N with u > q, the above inequality for r{x) is true for all subsets 
of U of size q, with probability at least 1 — u'^2e^^^ ^"^ . Fix q = na /(50(ln(n)) s ) and t = 2\Jq\n.(u). Then 
ii'^2e~* < 2ti~'^, which is < 1/4 for any sufficiently large n. When ||x||2 < Sg' then for large n, r{x) < 
g74/625 ^ 6 complete the proof, it suffices to show that when x ~ AA(0, Iqxq), the probability that ||x||2 > 3g 
is at most The random variable ||x||2 has a distribution with q degrees of freedom and expected value 

^I^ll^^lli = Yli=i — 1- Standard concentration bounds for variables imply that 



Pr [\\x\\l>M]< <\2-i, 



as we wanted to show. Thus, Lemma IbTT] implies err* (DTg, Fair (vr, vr', f/)) > 1/4 holds whenever this r(x) 
inequality is satisfied for all subsets of U of size q; we have shown this happens with probabiliity greater than 3/4, 
so we must have dactive > Q- D 

If we are only interested in bounding dcoarse, the proof can be somewhat simpUfied. Specifically, taking 6 = 
n"^ in Lemma lF2l implies that with probability at least 1 — , 

\\^D\X - Fd'IxII < \\^E\X - ^E'\X\\ < 1/4, 



which suffices for the claim that dcoarse = ^{K), where K = Q,{y/n/ K log{n)): in particular, dc 
J^((n/log(n))i/3). 



G Testing Semi-Supervised Learning Assumptions 

We now consider testing of common assumptions made in semi-supervised learning |[T5l . where unlabeled data, 
together with assumptions about how the target function and data distribution relate, are used to constrain the search 
space. As mentioned in Section [51 one such assumption we can test using our generic disjoint-unions tester is the 
cluster assumption, that if data lies in N identifiable clusters, then points in the same cluster should have the same 
label. We can in fact achieve the following tighter bounds: 

Theorem G.l. We can test the cluster assumption with active testing using 0{N/e) unlabeled examples and 0(1 /e) 
queries. 

Proof. Let pn and piQ denote the probability mass on positive examples and negative examples respectively in 
cluster i, so pn + piQ is the total probabilty mass of cluster i. Then dist{f,V) = min(pji,pio)- Thus, a simple 
tester is to draw a random example x, draw a random example y from x's cluster, and check if f{x) = f{y). Notice 
that with probability exactly dist{f, V), point x is in the minority class of its own cluster, and conditioned on this 
event, with probability at least 1/2, point y will have a different label. It thus suffices to repeat this process 0(l/e) 
times. One complication is that as stated, this process might require a large unlabeled sample, especially if x belongs 
to a cluster i such that pio + Pn is small, so that many draws are needed to find a point y in x's cluster. To achieve 
the given unlabeled sample bound, we initially draw an unlabeled sample of size 0{N/e) and simply perform the 
above test on the uniform distribution U over that sample, with distance parameter e/2. Standard sample complexity 
bounds ll62l imply that 0{N/e) unlabeled points are sufficient so that if distoif, V) > e then with high probability, 
distu{f,V)>e/2. □ 
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We now consider the property of a function having a large margin with respect to the underlying distribution: 
that is, the distribution D and target / are such that any point in the support of -D|/=i is at distance 7 or more 
from any point in the support of D\f=Q. This is a common property assumed in graph-based and nearest-neighbor- 
style semi-supervised learning algorithms lITSl . Note that we are not additionally requiring the target to be a linear 
separator or have any special functional form. For scaling, we assume that points lie in the unit ball in R^, where we 
view d as constant and I/7 as our asymptotic parameter. Since we are not assuming any specific functional form for 
the target, the number of labeled examples needed for learning could be as large as ( 1 77^^ ) by having a distribution 
with support over 0(1/7'^) points that are all at distance 7 from each other (and therefore can be labeled arbitrarily). 
Furthermore, passive testing would require ^1(1/7'^/^) samples as this specific case encodes the cluster-assumption 
setting with N = 0(1/7'^) clusters. We will be able to perform active testing using only 0(l/e) label requests. 

First, one distinction between this and other properties we have been discussing is that it is a property of the 
relation between the target function / and the distribution D; i.e., of the combined distribution Df = [D, f ) over 
labeled examples. As a result, the natural notion of distance to this property is in terms of the variation distance 
of Z? J to the closest satisfying the property. As a simple example illustrating the issue, consider X = [0, 1], a 
target / that is negative on [0, 1/2) and positive on [1/2, 1], and a distribution D that is uniform but where the region 
[1/2, 1/2 + 7] is downweighted to have total probability mass only 1/2". Such a Dj is l/2"-close to the property 
under variation distance, but would be nearly 1/2-far from the property if the only operation allowed were to change 
the function /. A second issue is that we will have to also allow some amount of slack on the 7 parameter as well. 
Specifically, our tester will distinguish the case that D j indeed has margin 7 from the case that the Dj is e-fai from 
having margin 7' where 7' = 7(1 — 1/c) for some constant c > 1; e.g., think of 7' = 7/2. This slack can also be 
seen to be necessary (see discussion following the proof of Theorem |5.2| ). In particular, we have the following. 

Theorem 15.21 (Restated). For any 7, 7' = 7(1 — 1 /c) for constant c > 1, for data in the unit ball in R'^ for constant 
d, we can distinguish the case that Dj has margin ^ from the case that Dj is e-far from margin 7' using Active 
Testing with 0(l/(7^^e^)) unlabeled examples and 0(l/e) label requests. 

Proof. First, partition the input space X (the unit ball in R'^) into regions Ri,R2,... ,Rn of diameter at most 
7/ (2c). By a standard volume argument, this can be done using = 0(1/7*^) regions (absorbing "c" into the 0()). 
Next, we run the cluster-property tester on these regions, with distance parameter e/4. Clearly, if the cluster-tester 
rejects, then we can reject as well. Thus, we may assume below that the total impurity within individual regions is 
at most e/4. 

Now, consider the following weighted graph Gj. We have N vertices, one for each of the regions. We have 
an edge between regions Ri and Rj if diam(i2j U Rj) < 7. We define the weight w{i,j) of this edge to be 
min(D[i?j], D[Rj]) where D[R] is the probability mass in R under distribution D. Notice that if there is no edge 
between region Ri and Rj, then by the triangle inequality every point in Ri must be at distance at least 7' from 
every point in Rj. Also, note that each vertex has degree 0{c'^) = 0(1), so the total weight over all edges is 0(1). 
Finally, note that while algorithmically we do not know the edge weights precisely, we can estimate all edge weights 
to ibe/(4A/), where M = 0{N) is the total number of edges, using the unlabeled sample size bounds given in the 
Theorem statement. Let w{i,j) denote the estimated weight of edge 

Let Ewitness be the set of edges such that one endpoint is majority positive and one is majority negative. 
Note that if Df satisfies the 7-margin property, then every edge in EwUness has weight 0. On the other hand, if Df 
is e-far from the 7'-mai-gin property, then the total weight of edges in E^itness is at least 3e/4. The reason is that 
otherwise one could convert Df to D'f satisfying the margin condition by zeroing out the probability mass in the 
lightest endpoint of every edge (i, j) G E^itness, and then for each vertex, zeroing out the probability mass of points 
in the minority label of that vertex. (Then, renormalize to have total probability 1.) The first step moves distance at 
most 3e/4 and the second step moves distance at most e/4 by our assumption of success of the cluster-tester. Finally, 
if the true total weight of edges in E^itness is at least 3e/4 then the sum of their estimated weights w{i,j) is at least 
e/2. This implies we can perform our test as follows. For 0(l/e) steps, do: 

1. Choose an edge with probability proportional to 'w{i, j). 
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2. Request the label for a random x G Ri and y ^ Rj. If the two labels disagree, then reject. 

If is e-far from the 7'-margin property, then each step has probability w{Ewitness) / w{E) = 0(e) of choosing a 
witness edge, and conditioned on choosing a witness edge has probability at least 1/2 of detecting a violation. Thus, 
overall, we can test using 0(1 /e) labeled examples and 0(1/ (T^'^e^)) unlabeled examples. □ 

On the necessity of slack in testing the margin assumption: Consider an instance space X = [0, 1]^ and two 

distributions over labeled examples Di and D2. Distribution Di has probability mass 1/2"+^ on positive examples 
at location (0, i/2") and negative examples at (7', f/2") for each i = 1, 2, . . . , 2", for 7' = 7(1 - 1/2^*^). Notice 
that Di is 1/2-far from the 7-margin property because there is a matching between points in the support of Di\f=i 
and points in the support of Z^i | j=o where the matched points have distance less than 7. On the other hand, for each 
i = 1, 2, . . . , 2", distribution D2 has probability mass 1/2" at either a positive point (0, i/2") or a negative point 
(7', i/2"), chosen at random, but zero probability mass at the other location. Distribution D2 satisfies the 7-margin 
property, and yet Di and D2 cannot be distinguished using a polynomial number of unlabeled examples. 
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